Module 05
\[ \sum_j p_j = 1 \]
\[ p[a \le x \le b] = \int_a^b f(x) dx \]
\[ \int_{-\infty}^{\infty} f(x)dx = 1 \]
Normal Distribution
Uniform Distribution
t Distribution
F Distribution
Chi-Square Distribution
Looks like?
The uniform distribution defines equal probability over a given range for a continuous distribution. For this reason, it is important as a reference distribution.
One of the most important applications of the uniform distribution is in the generation of random numbers. That is, almost all random number generators generate random numbers on the (0,1) interval. For other distributions, some transformation is applied to the uniform random numbers.
In theory, requires multiple parameters to define:
- location (mean)
- scale (standard deviation)
- shape (degrees of freedom, df)
But, we will always refer back to a “standardized” distribution curve centered at zero.
(Notice how quickly we approach a normal distribution.) . . .
tibble(x = c(-4, 4)) %>%
ggplot() +
stat_function(mapping = aes(x = x),
fun = dt,
args = list(df = 2), color = "blue") +
stat_function(mapping = aes(x = x),
fun = dt,
args = list(df = 9), color = "green") +
stat_function(mapping = aes(x = x),
fun = dt,
args = list(df = 29), color = "red") +
stat_function(mapping = aes(x = x),
fun = dnorm,
args = list(mean = 0, sd = 1), color = "black") +
labs(
title = "t Distribution",
subtitle = "df = 2 (blue), df = 9 (green), df = 29 (red), normal (black)",
y = "Probability Density"
) +
theme_classic()The t distribution is used in many cases for the critical regions for hypothesis tests and in determining confidence intervals. The most common example is testing if data are consistent with the assumed process mean.
Requires multiple parameters to define (location, scale, and shape), but is normally treated as a standardized distribution with location and scale omitted.
tibble(x = c(0, 20)) %>%
ggplot() +
stat_function(mapping = aes(x = x),
fun = dchisq,
args = list(df = 1), color = "blue") +
stat_function(mapping = aes(x = x),
fun = dchisq,
args = list(df = 2), color = "green") +
stat_function(mapping = aes(x = x),
fun = dchisq,
args = list(df = 5), color = "red") +
stat_function(mapping = aes(x = x),
fun = dchisq,
args = list(df = 10), color = "black") +
labs(
title = "Chi-Square Distribution",
subtitle = "df = 1 (blue), df = 2 (green), df = 5 (red), df = 10 (black)",
y = "Probability Density"
) +
theme_classic()The chi-square distribution is used in many cases for the critical regions for hypothesis tests and in determining confidence intervals.
Two common examples are the chi-square test for independence in an RxC contingency table and the chi-square test to determine if the standard deviation of a population is equal to a pre-specified value.
The F distribution is the ratio of two chi-square distributions with degrees of freedom ν1 and ν2, respectively, where each chi-square has first been divided by its degrees of freedom.
My most common use of the F distribution is to calculate an upper control limit (UCL) for standard deviation run charts.
tibble(x = c(0, 5)) %>%
ggplot() +
stat_function(mapping = aes(x = x),
fun = df,
args = list(df1 = 1, df2 = 1), color = "blue") +
stat_function(mapping = aes(x = x),
fun = df,
args = list(df1 = 1, df2 = 10), color = "green") +
stat_function(mapping = aes(x = x),
fun = df,
args = list(df1 = 10, df2 = 1), color = "red") +
stat_function(mapping = aes(x = x),
fun = df,
args = list(df1 = 10, df2 = 10), color = "black") +
labs(
title = "F Distribution",
subtitle = "df1 = 1, df2 = 1 (blue), df1 = 1, df2 = 10 (green), df1 = 10, df2 = 10 (red), df1 = 10, df2 = 10 (black)",
y = "Probability Density"
) +
theme_classic()The F distribution is used in many cases for the critical regions for hypothesis tests and in determining confidence intervals.
Two common examples are the analysis of variance and the F test to determine if the variances of two populations are equal.
Don’t get boxed into assuming there is only the normal probability distribution. Be aware that there are other distributions.
Although we will use (and abuse) the normal probability distribution.
A study of 66,831 dairy cows found the mean milk yield was 12.5 kg per milking with a standard deviation of 4.3 kg per milking.
What range represents the 95% of the population?
stat_fucntion()tibble(x = c(0, 25)) |>
ggplot() +
stat_function(mapping = aes(x = x),
fun = dnorm,
args = list(mean = 12.5, sd = 4.3),
color = "blue") +
stat_function(mapping = aes(x = x),
fun = dnorm,
args = list(mean = 12.5, sd = 4.3),
xlim = c(12.5 - 1.96*(4.3), 12.5 + 1.96*(4.3)),
geom = "area",
fill = "#000099", alpha = 0.2) +
labs(
title = "Milk production with 95% of population shaded \nassuming a normal distribution",
subtitle = "mean = 12.5, sd = 4.3, z(crtical) = ±1.96",
y = "Probability Density"
) +
theme_classic()Instead of 66,831 dairy cows, we only had 6, but had the same descriptive statistics.
What would our probability distribution and 95% probability of data look like?
tibble(x = c(-3, 3)) |>
ggplot() +
stat_function(mapping = aes(x = x),
fun = dnorm,
color = "black") +
stat_function(mapping = aes(x = x),
fun = dt,
args = list(df = 5),
color = "blue") +
stat_function(mapping = aes(x = x),
fun = dt,
args = list(df = 5),
xlim = c(-2.57, 2.57),
geom = "area",
fill = "#000099", alpha = 0.2) +
labs(
title = "Milk production with 95% of population shaded \nfor 6 cows with t distribution",
subtitle = "t distribution, t(crtical) = ±2.57 \nblack curve: normal distribution",
y = "Probability Density"
) +
theme_classic()Create plots for normal, t, and F distributions using ggplot’s stat_function() layer.
Provide a brief description of each distribution and an example of how it is used. Include descriptive titles, sub-titles, axis labels
For each distribution, provide the critical value(s) used to estimate 95% probability. (Only one example is needed if multiple distributions are possible.) Describe if you are using a “one-tailed” or “two-tailed” value.
Accuracy is a qualitative term referring to whether there is agreement between a measurement made on an object and its true (target or reference) value.
Bias is a quantitative term describing the difference between the average of measurements made on the same object and its true value. In particular, for a measurement laboratory, bias is the difference (generally unknown) between a laboratory’s average value (over time) for a test item and the average that would be achieved by the reference laboratory if it undertook the same measurements on the same test item.
Low Accuracy
Low Precision
Low Accuracy
High Precision
High Accuracy
Low Precision
High Accuracy
High Precision
What would this look like?
set.seed(2020)
label_string <- as_labeller(c(process_1 = "Process 1", process_2 = "Process 2"))
my_dnorm_function <- function(x) {
dnorm(seq(40, 160, 1), mean = x, sd = 10)
}
process_simulation <- tibble(
day = rep(1:10),
process_1 = rnorm(10, mean = 100, sd = 2),
process_2 = rnorm(10, mean = 120, sd = 15)
) |>
pivot_longer(2:3, names_to = "mean_process", values_to = "mean_values") |>
mutate(simulated_distribution = map(mean_values, my_dnorm_function)) |>
unnest(cols = c(simulated_distribution)) |>
mutate(x_values = rep(seq(40, 160, 1), times = 20))
process_simulation |>
ggplot(aes(x = x_values, y = day, height = simulated_distribution*100, fill = as.factor(day))) +
facet_wrap(vars(mean_process), labeller = label_string) +
geom_ridgeline(alpha = 0.8) +
geom_vline(xintercept = 100, color = "black") +
geom_vline(xintercept = c(70, 130), color = "red", linetype = "dashed") +
labs(
x = "Process Value",
y = "Distributions Over Time",
fill = "Day",
title = "Distributions of short-term measurements over 10 days",
subtitle = "Distances from the centerline illustrate between-day variability"
) +
guides(guides(fill = guide_legend(reverse = TRUE))) +
scale_fill_brewer(palette = "Spectral") +
theme_classic() +
theme(axis.text.y = element_blank(), axis.ticks.y = element_blank())and, provide Best Known Methods (BKMs) or Standard Operating Procedures (SOPs) to ensure consistent implementation.
As an example, the NIST e-Handbook provides a detailed description of the terms
For \(j\) measurements over \(k\) days, the NIST e-Handbook defines the short-term variability as the standard deviation of the measurements:
\[ {\large s}_k = \sqrt{\frac{1}{J-1} \sum_{j=1}^{J} ( Y_{kj} - \overline{Y}_{k \, \small{\bullet}} ) ^2} \]
where:
\[ \overline{Y}_{k \, \small{\bullet}} = \frac{1}{J}\sum_{j=1}^{J} Y_{kj} \] is the mean of the \(k\)th day, and \(Y_{kj}(k=1, \,\ldots, \, K, \,\, j=1, \,\ldots, \, J)\) are the individual measurements.
level_1_kj |>
ggplot(aes(x = day_k, y = mean_k, ymin = mean_k - 2*sd_k, ymax = mean_k + 2*sd_k)) +
geom_point() +
geom_errorbar(width = 0.2) +
labs(
x = "Day",
y = "Mean Measurement",
title = "Short-term Variability: Precision",
subtitle = "Error bars represent \U00B1two standard deviations",
# subtitle = TeX(r"(\textrm{Error bars represent }$\pm$\textrm{two standard deviations})"),
caption = "Data generated using rnorm(150, mean = 100, sd = 3)"
) +
theme_classic()level_1_2_kj |>
ggplot(aes(x = day_k, y = mean_k, ymin = mean_k - 2*sd_k, ymax = mean_k + 2*sd_k)) +
geom_point() +
geom_errorbar(width = 0.2) +
geom_hline(yintercept = c(94, 106), color = "red", linetype = "dashed") +
labs(
x = "Day",
y = "Mean Measurement",
title = "Short-term Variability: Precision",
subtitle = "Error bars represent \U00B1two standard deviations",
# subtitle = TeX(r"(\textrm{Error bars represent }$\pm$\textrm{two standard deviations})"),
caption = "Data generated using rnorm(900, mean = 100, sd = 3)"
) +
theme_classic()Level-2 Standard Deviation
\[ {\large s}_2 = \sqrt{\frac{1}{K-1} \sum_{k=1}^{K} \left( \overline{Y}_{k \, \small{\bullet}} - \overline{Y}_{\small{\bullet} \small{\bullet}} \right) ^2} \]
where:
\[ \overline{Y}_{\small{\bullet} \small{\bullet}} = \frac{1}{K} \sum_{k=1}^{K} \overline{Y}_{k \, \small{\bullet}} \]
is the grand mean of all measurements, and \(\overline{Y}_{k \, \small{\bullet}}\) is the mean of the \(k\)th day.
Let’s assume our day-to-day variability follows a random walk; starting with our target value of 100, and each day we add a random value from a normal distribution with a mean of 100 and a standard deviation of 3.
day_to_day |>
ggplot(aes(x = day, y = measurement)) +
geom_line() +
geom_point() +
geom_errorbar(aes(ymin = measurement - 6, ymax = measurement + 6), width = 0.2) +
geom_hline(yintercept = c(94, 106), color = "red", linetype = "dashed") +
labs(
x = "Day",
y = "Measurement",
title = "Day-to-Day Variability: Repeatability",
caption = "Data generated using cumsum(rnorm(30, mean = 0, sd = 1))"
) +
theme_classic()\[ {\large s}_{3} = \sqrt{\frac{1}{L-1} \sum_{l=1}^{L}{\left( Y_{l{\small \, \bullet \bullet}} - \overline{Y}_{{\small \bullet \bullet \bullet}} \right)^2}} \] where
\[ \overline{Y}_{{\small \bullet \bullet \bullet}} = \frac{1}{L}\sum_{l=1}^{L}{\overline{Y}_{l {\small \, \bullet \bullet}}} \]
day_to_day_1000 |>
ggplot(aes(x = day, y = measurement)) +
geom_line() +
geom_errorbar(aes(ymin = measurement - 6, ymax = measurement + 6), width = 0.0) +
geom_hline(yintercept = c(94, 106), color = "red", linetype = "dashed") +
labs(
x = "Day",
y = "Measurement",
title = "Day-to-Day Variability: Random Walk over 365 \"days\"",
caption = "Data generated using cumsum(rnorm(365, mean = 0, sd = 1))"
) +
theme_classic()day_to_day_1000 |>
filter(day <= 90) |>
ggplot(aes(x = day, y = measurement)) +
geom_line() +
geom_errorbar(aes(ymin = measurement - 6, ymax = measurement + 6), width = 0.0) +
geom_hline(yintercept = c(94, 106), color = "red", linetype = "dashed") +
labs(
x = "Day",
y = "Measurement",
title = "Day-to-Day Variability: Random Walk over 90 \"days\"",
caption = "Data generated using cumsum(rnorm(365, mean = 0, sd = 1))"
) +
theme_classic()days_to_keep <- c(1:7, 31:37, 61:67)
day_to_day_1000 |>
filter(day %in% days_to_keep) |>
ggplot(aes(x = day, y = measurement)) +
geom_point() +
geom_errorbar(aes(ymin = measurement - 6, ymax = measurement + 6), width = 0.0) +
geom_hline(yintercept = c(94, 106), color = "red", linetype = "dashed") +
labs(
x = "Day",
y = "Measurement",
title = "Day-to-Day Variability: Random Walk over 90 \"days\"",
caption = "Data generated using cumsum(rnorm(365, mean = 0, sd = 1))"
) +
theme_classic()days_to_keep <- c(1:7, 31:37, 61:67)
day_to_day_1000 |>
filter(day %in% days_to_keep) |>
mutate(run_id = case_when(
day <= 7 ~ "Run 1",
day > 30 & day <= 37 ~ "Run 2",
TRUE ~ "Run 3")
) |>
group_by(run_id) |>
summarise(mean_measurement = mean(measurement)) |>
summarise(count = n(), sd_measurement = sd(mean_measurement))\[ \begin{aligned} H_0 & : \textrm{a null hypothesis} \\ H_a & : \textrm{an alternative hypothesis} \end{aligned} \]
We do not prove the alternative hypothesis, we only reject or fail to reject the null hypothesis.
The following numbers are particle (contamination) counts for a sample of 10 semiconductor silicon wafers:
50 48 44 56 61 52 53 55 67 51
The mean = 53.7 counts and the standard deviation = 6.567 counts.
Over a long run the process average for wafer particle counts has been 50 counts per wafer, and on the basis of the sample, we want to test whether a change has occurred.
What is the null hypothesis and alternative hypothesis?
Is this a one-sided or two-sided test?
What test would you use?
\[ t = \frac{\overline{Y} - \mu_0}{s/\sqrt{n}} \]
tibble(x = c(-3, 3)) |>
ggplot() +
stat_function(mapping = aes(x = x),
fun = dt,
args = list(df = 9),
color = "blue") +
stat_function(mapping = aes(x = x),
fun = dt,
args = list(df = 9),
xlim = c(-2.262, 2.262),
geom = "area",
fill = "#000099", alpha = 0.2) +
geom_vline(xintercept = t_test, color = "black") +
annotate("label", x = t_test, y = 0.3, label = paste0("t-test \ncalculated value\n", round(t_test, 3)), color = "black") +
labs(
title = "t-distribution with 95% of population shaded",
subtitle = "t distribution, t(crtical) = ±2.262",
y = "Probability Density"
) +
theme_classic()Setup the null and alternative hypothesis.
\[ \begin{aligned} H_0 & : \mu_1 = \mu_2 \\ H_a & : \mu_1 \neq \mu_2 \end{aligned} \]
For a one-sided test
\[ \begin{aligned} H_0 & : \mu_1 \le \mu_2 \\ H_a & : \mu_1 > \mu_2 \\ \textrm{or} \\ H_0 & : \mu_1 \ge \mu_2 \\ H_a & : \mu_1 < \mu_2 \end{aligned} \]
\[ \overline{Y1} = \frac{1}{n_1} \sum_{i=1}^{n_1} Y1_i \, , \,\,\,\,\, \overline{Y2} = \frac{1}{n_2} \sum_{i=1}^{n_2} Y2_i \]
\[ s_1 = \sqrt{\frac{1}{n_1-1} \sum_{i=1}^{n_1} (Y1_i - \overline{Y1})^2} \, , \,\,\,\,\, s_2 = \sqrt{\frac{1}{n_2-1} \sum_{i=1}^{n_2} (Y2_i - \overline{Y2})^2} \]
The type of t-test used will depend on whether the variances are equal or not.
\[ t = \frac{\overline{Y1} - \overline{Y2}}{s_p \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}} \]
where
\[ s_p = \sqrt{\frac{(n_1 - 1)s_1^2 + (n_2 - 1)s_2^2}{n_1 + n_2 - 2}} \]
\[ t = \frac{\overline{Y1} - \overline{Y2}}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}} \]
In this case, the degrees of freedom are not known exactly but can be estimated using the Welch-Satterthwaite equation.
\[ \nu = \frac{\left( \frac{s_1^2}{N_1} + \frac{s_2^2}{N_2} \right)^2} {\frac{s_1^4}{N_1^2(N_1-1)} + \frac{s_2^4}{N_2^2(N_2-1)}} \]
t.test()
function from the stats package in R.
?t.test
A new procedure (process 2) to assemble a device is introduced and tested for possible improvement in time of assembly. The question being addressed is whether the mean, , of the new assembly process is smaller than the mean, , for the old assembly process (process 1).
| Device | Process 1 (Old) | Process 2 (New) |
|---|---|---|
| 1 | 32 | 36 |
| 2 | 37 | 31 |
| 3 | 35 | 30 |
| 4 | 28 | 31 |
| 5 | 41 | 34 |
| 6 | 44 | 36 |
| 7 | 35 | 29 |
| 8 | 31 | 32 |
| 9 | 34 | 31 |
| 10 | 38 | |
| 11 | 42 |
Two Sample t-test
data: process_df$Process_2_New and process_df$Process_1_Old
t = -2.1353, df = 18, p-value = 0.04673
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.67502857 -0.06234517
sample estimates:
mean of x mean of y
32.22222 36.09091
[1] "Process 1 (Old) Mean: 36.09"
[1] "Process 1 (Old) Standard Deviation: 4.91"
[1] "Process 2 (New) Mean: 32.22"
[1] "Process 2 (New) Standard Deviation: 2.54"
\[ \begin{aligned} H_0 & : \sigma_1 = \sigma_2 \\ H_a & : \sigma_1 \neq \sigma_2 \end{aligned} \]
or for a one side test
\[ \begin{aligned} H_0 & : \sigma_1 \le \sigma_2 \\ H_a & : \sigma_1 > \sigma_2 \end{aligned} \]
For our case, we can state the null hypothesis as
\[ H_0 : \sigma_{new} \ge \sigma_{old} \\ H_a : \sigma_{new} < \sigma_{old} \]
F test to compare two variances
data: process_df$Process_2_New and process_df$Process_1_Old
F = 0.26751, num df = 8, denom df = 10, p-value = 0.03705
alternative hypothesis: true ratio of variances is less than 1
95 percent confidence interval:
0.0000000 0.8953837
sample estimates:
ratio of variances
0.2675052
Welch Two Sample t-test
data: process_df$Process_2_New and process_df$Process_1_Old
t = -2.2694, df = 15.533, p-value = 0.03788
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-7.4914303 -0.2459435
sample estimates:
mean of x mean of y
32.22222 36.09091
. . . study the effect of temperature on a passive component such as a resistor. We select three different temperatures and observe their effect on the resistors. This experiment can be conducted by measuring all the participating resistors before placing resistors each in three different ovens.
Each oven is heated to a selected temperature. Then we measure the resistors again after, say, 24 hours and analyze the responses, which are the differences between before and after being subjected to the temperatures.
The temperature is called a factor. The different temperature settings are called levels. In this example there are three levels or settings of the factor Temperature.
A factor is an independent treatment variable whose settings (values) are controlled and varied by the experimenter. The intensity setting of a factor is the level.
Levels may be quantitative numbers or, in many cases, simply “present” or “not present” (“0” or “1”).
In this experiment, there is only one factor, temperature, and the analysis is called a one-way or one-factor ANOVA.
For the one-way ANOVA, the null hypothesis is that all the population means are equal; the alternative hypothesis is that at least one population mean is different from the others.
\[ \begin{aligned} H_0 & : \mu_1 = \mu_2 = \ldots = \mu_k \\ H_a & : \textrm{at least one } \mu_i \textrm{ is different} \end{aligned} \]
where \(k\) is the number of levels of the factor.
\[ \begin{array}{ccccc} SS(Total) & = & SST & + & SSE \\ & & & & \\ \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar{y}_{\huge{\cdot \cdot}})^2 & = & \sum_{i=1}^k n_i (\bar{y}_{i \huge{\cdot}} - \bar{y}_{\huge{\cdot \cdot}})^2 & + & \sum_{i=1}^k \sum_{j=1}^{n_i} (y_{ij} - \bar{y}_{i \huge{\cdot}})^2 \end{array} \]
where \(k\) is the number of treatments and \(n_i\) is the number of observations in the \(i\)th treatment.
Certainly, I can create an ANOVA (Analysis of Variance) table using markdown for you, with the components labeled as Total SS, SST, and SSE. Here’s the table:
| Source of Variation | Sum of Squares | Degrees of Freedom | Mean Square | F-ratio |
|---|---|---|---|---|
| Between Groups (SST) | SST | k - 1 | MST = SST / (k - 1) | F = MST / MSE |
| Within Groups (SSE) | SSE | N - k | MSE = SSE / (N - k) | |
| Total (Total SS) | Total SS | N - 1 |
Df Sum Sq Mean Sq F value Pr(>F)
Level 2 27.90 13.949 9.591 0.00325 **
Residuals 12 17.45 1.454
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
data_long |>
ggplot(aes(x = Level, y = Value)) +
geom_point(position = position_jitter(width = 0.05), size = 3) +
labs(
x = "Level",
y = "Value",
title = "Example One-way ANOVA Example from NIST e-Handbook",
caption = "https://www.itl.nist.gov/div898/handbook/prc/section4/prc433.htm"
) +
theme_classic()greenhouse_df_2 <- read_table("fert species resp
control SppA 21.0
control SppA 19.5
control SppA 22.5
control SppA 21.5
control SppA 20.5
control SppA 21.0
control SppB 23.7
control SppB 23.8
control SppB 23.8
control SppB 23.7
control SppB 22.8
control SppB 24.4
f1 SppA 32.0
f1 SppA 30.5
f1 SppA 25.0
f1 SppA 27.5
f1 SppA 28.0
f1 SppA 28.6
f1 SppB 30.1
f1 SppB 28.9
f1 SppB 30.9
f1 SppB 34.4
f1 SppB 32.7
f1 SppB 32.7
f2 SppA 22.5
f2 SppA 26.0
f2 SppA 28.0
f2 SppA 27.0
f2 SppA 26.5
f2 SppA 25.2
f2 SppB 30.6
f2 SppB 31.1
f2 SppB 28.1
f2 SppB 34.9
f2 SppB 30.1
f2 SppB 25.5
f3 SppA 28.0
f3 SppA 27.5
f3 SppA 31.0
f3 SppA 29.5
f3 SppA 30.0
f3 SppA 29.2
f3 SppB 36.1
f3 SppB 36.6
f3 SppB 38.7
f3 SppB 37.1
f3 SppB 36.8
f3 SppB 37.1
", col_names = TRUE)
greenhouse_df_2Applied Statistical Techniques
Comments on the Normal Distribution
NIST e-Handbook
For both theoretical and practical reasons, the normal distribution is probably the most important distribution in statistics. For example,
Many classical statistical tests are based on the assumption that the data follow a normal distribution. This assumption should be tested before applying these tests.
In modeling applications, such as linear and non-linear regression, the error term is often assumed to follow a normal distribution with fixed location and scale.
The normal distribution is used to find significance levels in many hypothesis tests and confidence intervals.